A rapid and reference-free imputation method for low-cost genotyping platforms

Most current genotype imputation methods are reference-based, which posed several challenges to users, such as high computational costs and reference panel inaccessibility. Thus, deep learning models are expected to create reference-free imputation methods performing with higher accuracy and shortening the running time. We proposed a imputation method using recurrent neural networks integrating with an additional discriminator network, namely GRUD. This method was applied to datasets from genotyping chips and Low-Pass Whole Genome Sequencing (LP-WGS) with the reference panels from The 1000 Genomes Project (1KGP) phase 3, the dataset of 4810 Singaporeans (SG10K), and The 1000 Vietnamese Genome Project (VN1K). Our model performed more accurately than other existing methods on multiple datasets, especially with common variants with large minor allele frequency, and shrank running time and memory usage. In summary, these results indicated that GRUD can be implemented in genomic analyses to improve the accuracy and running-time of genotype imputation.

www.nature.com/scientificreports/org), along with inhouse and simulated datasets from SNP microarray chips and low-pass sequencing were used to assess the performance of GRUD in the comparison with other existing techniques.

Evaluation with 1KGP datasets
Infinium Omni2.5-8Beadchip Among different setting of loss functions of the generator, squared Pearson's correlation between allele dosages and true genotype counts-R 2 calculated on MAF bins greater than 0.005 of the Equivalent model showed a similar pattern with the Lower model, which were significantly higher than that from the Higher model.Hence, the Equivalent model will be chosen to be evaluated in comparison with the other existing methods (Fig. 1A) (GRUD model, in short).
We used Minimac4, Beagle5.4 as the existing imputation techniques based on Li and Stephen model, and a deep-learning model of Kojima, to compare the imputation performance of our proposed model.The comparison of R 2 values obtained by the squared correlation of WGS data and allele dosages from imputed genotype was displayed in Fig. 1B.The imputation accuracy for variants with MAF > 0.05 was slightly higher than Kojima, and significantly higher than Beagle5.4 and Minimac4.In terms of variants with MAF≤0.05, the R 2 of our proposed model was higher or comparable to those of Beagle5. 4 and Minimac4, but it was lower than Kojima.Overall, GRUD method (average R 2 =0.819, overall R 2 =0.948) performed comparably with Kojima (average R 2 =0.831, overall R 2 =0.946) (Table 1).
Regarding average memory usage, GRUD model only consumed 0.4389 GiB, nearly seven-time less than that of Kojima (2.8215 GiB) (Table 2) on Intel Core™ i9-9900K CPU @ 3.60GHz × 16 and 62,7 GiB memory in a thread.In addition, the running-time of GRUD model was 743s, which was significantly shorter than Kojima's (35,082s).It also should be noted that the running time of both methods depend on libraries and computational processors.In general, GRUD model was faster and took less resource than the Kojima method with default settings.

Low-pass whole genome sequencing
We employed GLIMPSE as a representative of imputation and phasing methods designed for low-coverage sequencing data, and compared the imputation accuracy of our model and this method (Fig. 1C).Generally, the performance of GRUD on low-pass data was better than GLIMPSE, particularly for the variants with MAF>0.02.The average R 2 per variant of GRUD and GLIMPSE was 0.0818234 and 0.165236, respectively (Table 1).

Discussion
By applying deep learning in the field of genotype imputation, we developed a new method namely GRUD to tackle the existing issues of other available methods.Our pre-trained model could be applied directly to new samples to impute genotypes, while the other HMM models have to compute parameters for each scratch once a new batch is given.We exploited the sequential layer whose ability to learn non-linear combinations and to transform input data into gaussian distribution, which could facilitate the learning process of the model, enhancing the ability to mimic real values.Furthermore, the additional discriminator network serves as a supportive model during the training process, this provides feedback to the generator to improve the accuracy and training time by detecting mistakes in the generated data to fix in the next iteration.Due to the exchange information between two networks, GRUD model could achieve convergence fastly.In testing phase, the generator would be separately employed to predict genotype.Kojima model required running 3 RNN models including Lower MAF and Higher MAF model prior to being combined to generate the final results by Hybrid model, while GRUD runs only one RNN model, thereby, the computational time of GRUD reduced significantly.Moreover, the imputation accuracy of GRUD model could be comparable with Kojima model and HMMbased methods when performing in 1KGP, SG10K, and Vietnamese Genomic datasets derived from different platforms and diverse origins.There was a tendency of lower R 2 values in rare variants, especially those imputed by deep-learning-based models which do not take into account the genetic background knowledge, the performance of GRUD model was thereby shown a similar pattern with Kojima model.Moreover, despite the sequential layer improving running time, the imputation accuracy for variants with MAF≤0.02 was lower in R 2 values compared to the Kojima model in 1KGP data.This is because there are less differences in genotype between samples at these loci, leading to the difficulty for the sequential layer to learn the features and update its weight.Nevertheless, the R 2 of GRUD model is considerably higher than others for common variants with MAF>0.05, which accounts for 89,241 of 209,427 variants (42.6%) considered on chromosome 22 in 1KGP phase 3 dataset.In addition to 1KGP data and SG10K, the proposed model was shown comparable or even superior results with other methods for two in-house Vietnamese datasets which were generated by two popular SNP microarrays from Illumina and Affymetrix.Another important advantage of GRUD was the ability to efficiently perform on low-coverage sequencing data.Although the model was evaluated on data from 21 samples of 1KGP, the results showed positive signals, particularly for high-frequency variants, which are paid more attention in downstream analyses 17 .This application could reduce the cost of obtaining genotype data, while still guaranteeing the accuracy of genetic information valuable in clinical practice as well as the implementation of precision medicine.Hence, GRUD might be a potential alternative to traditional methods for genotype imputation on multiple low-cost genotyping techniques.
Currently, the training input of GRUD model needs to have the same shape and fixed positions (e.g., chromosome, start, end) with testing input, which poses several challenges when applying different shapes of input.According to the idea of the masked language model 18,19 , we are looking forward to a trained model with sequencing data that is randomly hidden a number of positions for training in order to allow the model to perform on data generated from different platforms.Besides, we also suggested applying the transfer learning method in the issues of genotype imputation 20 , this could be based on the technique designed with training for a certain task and is used as the foundation for others.It is possible to train the model with genotype data of a chromosome, and then apply it to the remaining others, which could reduce the number of epochs and potentially fit a dataset with small samples in the learning step to improve training time and also performance.

Related work
Generative adversarial networks (GANs) GAN architecture comprising two competing neural networks called the Generator-G and the Discriminator-D was firstly introduced in the field of computer vision.From the random noise z , G is able to generate a fake data G(z) , then the second component-D network makes an effort to dif- ferentiate real and fake data from each other.Both networks constantly compete with each other.The generator network attempts to fool the discriminator network.At that point, the discriminator network adapts to the new fake data.This data is then utilized to enhance the generator network, and so forth.
In recent years, a number of authors have proposed applying GAN into language models to address the issues of translation, summarization, etc. 21,22 .In the context of genotype imputation, considering genome as a long string data, the GAN has been potentially used to impute the unobserved variants by utilizing the haplotype reference panel to promote the application of low-cost genotyping approaches, such as SNP microarray chip and low-pass sequencing.www.nature.com/scientificreports/Recurrent neural network (RNN) is a popular algorithm for many applications in sequential or time series data thanks to its internal memory in deep learning model architecture 23 .In short, RNN has 2 inputs: the present and the recent past, consequently, the data sequence contains essential information about what will happen next.However, one primary problem of traditional RNN is the difficulty in accessing information computed from a long time ago, which is called vanishing or exploding gradient.Therefore, other types of RNN cells such as Long Short-Term Memory (LSTM) 24 , and Gated Recurrent unit (GRU) 25 were developed to handle this problem.The gates of these two cell types are capable of learning which data in a sequence should be kept or ignored.Thereby, it could pass relevant information through the long chain of sequences to make predictions.In prior evaluations, due to a simpler structure, GRU was shown to be more computationally efficient than LSTM 26 , so we decided to use GRU cells for our generator model.In specific, the GRU cell consists of 2 main gates: Update gate Ŵ u , and Reset gate Ŵ r which determine what information should be passed to the next state, and these are also able to create output simultaneously.The general formula of both gates is defined as: where W, U are coefficients specific to the gate, the σ is the sigmoid function, the x t is the input of timestep t, and the a t−1 is the hidden state of timestep before.The information of the new memory will use the reset gate to store information related to the past: where a ′ t is a forget value range from [−1, 1] .In the final step, the output of the network is to compute a t -the vector containing all the information at time t and transmitting it.To do this, the update gate is required and determines what to collect from the current memory-a ′ t and information from the previous steps a t − 1 : Note that the sign ⊙ denotes the element-wise multiplication between two vectors.

Model
The genotype data could be split into two smaller groups: observed variants and unobserved variants.The observed variants are extracted from the genotyping array or sequenced by the low coverage sequencing, while the unobserved variants are not directly identified by genotyping methods, these would be predicted by imputation models.In the imputation process, we assumed both the observed and unobserved variants were sorted according to their positions on the genome.Given the GRU cells we used to construct the model have restricted memory usage, a chromosome was split into different regions, and each region has approximately 1000 observed variants and unobserved variants in our experiments.Noteworthily, observed variants in the upstream and downstream extending section of the region were also included in the input data for the imputation model.Then, the imputation results from all regions are concatenated together in the final result.Assuming that both observed and unobserved variants are biallelic.In other words, their alleles are represented by binary values (0 and 1 for reference (ref) and alternative (alt) allele, respectively).Our model was designed on the basis of GAN architecture comprising two components: Generator and Discriminator (Fig. 4).The generator is a deep network able to generate unobserved genotypes based on the input being observed An overview of GRUD model comprising two components: Generator and Discriminator.The generator is a deep network potentially being able to generate unobserved genotypes based on the input of observed variants, and the discriminator tries to distinguish the generated genotype with a haplotype reference panel to determine whether this is a real or fake genotype.RNN: Recurrent neural network.
variants, and the discriminator tries to distinguish the generated genotype with a haplotype reference panel to determine whether this is a real or fake genotype.During the training process, two networks both compete and improve each other at the same time to get better and better imputation accuracy.After that, the generator would be separated from the discriminator to perform imputation with data from SNP microarray chip and low-pass sequencing.
In practical, loss function for training the model over unobserved variants presented in equation 4 is the cumulation of the generator model L G and the discriminator model L D , which both are binary cross entropy (BCE) between training and reference data.The structure of the generator model.The proposed model uses a sequential layer before passing bidirectional GRU cells for extracting features.The line in the bottom illustrates a set of variants in their order on the genome where observed variants are in blue rounded square and unobserved are in white ones.Forward and backward RNNs are developed, building off of the observed variants.The input feature of the forward and backward RNNs is x v i for observed variant v i .With ĩ is the index for the closet observed variant i, F i is the vector formed by concatenating the forward RNN's output for observed variant vi and the output of the backward RNN for the observed variant v i+1 .y u i is a binary variable representing the allele for unobserved variant u i .(B) Model structure of the discriminator.The input of this model is y u 1 , . . ., y u n where n is the number of unobserved variants, y u n is the output of the softmax function in the generator.
Table 1.R 2 of the correlation of true genotype and imputed genotype provided by GRUD and other existing models calculated on chromosome 22 variants.Overall R 2 : calculated for all variants on chromosome 22 without being separated according to MAF bins.Average R 2 : calculated for each variant.

Figure 1 .
Figure 1.Accuracy assessment of GRUD model with 1000 genomes project phase 3 (1KGP3) data compared to other existing methods.(A) Comparison of R 2 in MAF scales between different settings of the model namely Higher, Lower and Equivalent on 1KGP3 chromosome 22 (Omni2.5).Higher: the model trained with higher priority to high MAF; Lower: the model trained with higher priority to low MAF; Equivalent: the model trained with no priority to MAF. (B) Comparison of R 2 values in linear MAF scale between the proposed method, Minimac4, Kojima, and Beagle5.4 imputation methods for 100 individuals in 1KGP3 on chromosome 22 (C) Comparison of R 2 values between the proposed method and GLIMPSE for 21 individuals on chromosome 22 of 1KGP3 dataset.MAF: Minor allele frequency; R 2 : Squared Pearson correlation.

Figure 2 .
Figure 2. Accuracy assessment of GRUD model with SG10K data compared to other existing methods.MAF Minor allele frequency, R 2 Squared Pearson correlation.

Figure 3 .
Figure 3. Accuracy assessment of GRUD model with VN1K data compared to other existing methods.(A) The comparison of R 2 values between our method, Minimac4 and Beagle5.4 on chromosome 22 of the GSAv3 dataset (24 samples).(B) The accuracy of R 2 values between our method, Minimac4 and Beagle5.4 on chromosome 22 of the APMRA dataset (94 samples).MAF Minor allele frequency, R 2 Squared Pearson correlation.

Figure 5 .
Figure 5.An detailed architecture of GRUD model.(A)The structure of the generator model.The proposed model uses a sequential layer before passing bidirectional GRU cells for extracting features.The line in the bottom illustrates a set of variants in their order on the genome where observed variants are in blue rounded square and unobserved are in white ones.Forward and backward RNNs are developed, building off of the observed variants.The input feature of the forward and backward RNNs is x v i for observed variant v i .With ĩ is the index for the closet observed variant i, F i is the vector formed by concatenating the forward RNN's output for observed variant vi and the output of the backward RNN for the observed variant v i+1 .y u i is a binary variable representing the allele for unobserved variant u i .(B) Model structure of the discriminator.The input of this model is y u 1 , . . ., y u n where n is the number of unobserved variants, y u n is the output of the softmax function in the generator.

2 Average R 2 Overall R 2 Average R 2 Overall R 2 Average R 2 Overall R 2 Average R 2 Overall R 2 Average R 2
The bold terms indicated the highest values of each column

Table 2 .
Running-time and average memory usage of the GRUD and Kojima model performing in 1KGP with Infinium Omni 2.5-8 Beadchip dataset.The bold terms indicated the lowest values of each row